{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 前処理(bag of words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__bag of words__<br>\n",
    "特徴量として単語の出現回数を配列化<br>\n",
    "sklearn.feature_extraction.text.countvectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "①lower case<br>\n",
    "→sklearn：countvectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ai is our friend and it has been friendly'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "'AI is our friend and it has been friendly'.lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "②lematization or stemming<br>\n",
    "単語をそろえてくれるツール\n",
    "以下は性能が低そう。\n",
    "from nltk.stem import WordNetLemmatizer\n",
    "lemmatizer = WordNetLemmatizer() \n",
    "lemmatizer.lemmatize(\"bad?)\")\n",
    "\n",
    "stemmer = nltk.PorterStemmer()\n",
    "stemmer.stem(\"anymore!\t\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "③stopwords<br>\n",
    "冠詞、前置詞など推測に関係なさそうなものを取り除く。<br>\n",
    "→NLTK"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 前処理(Ngrams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "xxxx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 正規化"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "①tf<br>\n",
    "単語数で割る。(一文ごとの正規化)<br>\n",
    "②idf<br>\n",
    "単語の希少性<br>\n",
    "log(文書の総数/単語tを含む文字数)<br>\n",
    "__今回のコンペでは文書は独立していると考えたい→②は控える__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# メトリック：Jacord similarity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_jaccard_sim(str1, str2): \n",
    "    a = set(str1.split()) \n",
    "    b = set(str2.split())\n",
    "    c = a.intersection(b)\n",
    "    return float(len(c)) / (len(a) + len(b) - len(c))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Jaccard similarity of above two sentences is 0.33333333333333337\n"
     ]
    }
   ],
   "source": [
    "import nltk \n",
    "w1 = set('AI is our friend and it has been friendly'.lower().split())\n",
    "w2 = set('AI and humans have always been friendly'.lower().split())\n",
    " \n",
    "print (\"Jaccard similarity of above two sentences is\",1-nltk.jaccard_distance(w1, w2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'ai', 'and', 'been', 'friend', 'friendly', 'has', 'is', 'it', 'our'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w1 = set('AI is our friend and it has been friendly'.lower().split())\n",
    "w1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 可視化\n",
    "wordcloud<br>\n",
    "文章中で出現頻度が高い単語を複数選び出し、その頻度に応じた大きさで図示する手法。 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "keras：転移学習モデル __BERT__<br>\n",
    "import transformers<br>\n",
    "tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-uncased')  ## change it to commit<br>\n",
    "from tokenizers import BertWordPieceTokenizer<br>\n",
    "Reload it with the huggingface tokenizers library<br>\n",
    "fast_tokenizer = BertWordPieceTokenizer('distilbert_base_uncased/vocab.txt', lowercase=True)<br>\n",
    "complete-eda-bert-lstmに記載あり。<br>\n",
    "__キーワード：transformers, tokenizers, BertWordPieceTokenizer__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow.keras.layers import Dense, Input\n",
    "from tensorflow.keras.optimizers import Adam\n",
    "from tensorflow.keras.models import Model\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation\n",
    "from tensorflow.keras.layers import Bidirectional, GlobalMaxPool1D,Bidirectional\n",
    "from tensorflow.keras.models import Model\n",
    "from tensorflow.keras.layers import TimeDistributed\n",
    "from tensorflow.keras.layers import concatenate\n",
    "from tensorflow.compat.v1.keras.layers import CuDNNLSTM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tokenizers import BertWordPieceTokenizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import transformers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# discussionのネタ"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "xxxx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Submit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "①complete-eda-baseline-modelをコピー(__baseline__)<br>\n",
    "　0.397\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 次回作業\n",
    "・kernel投稿<br>\n",
    "　→neutralはそのままにしてみる②<br>\n",
    "・ローカルで評価ができる実装構築<br>\n",
    "・sentimentの重みを大きくするように模索&(sentimentは良し悪しを表すので何か表現できそう)③<br>\n",
    "・学習モデルエポック数最適値④"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 優先\n",
    "・一つも削除しないまま原文を予測として出している。<br>\n",
    "→create_targets, convert_outputの理解<br>\n",
    "\n",
    "__改善しようがなければ別のモデルを使うしかなさそう。__\n",
    "→\n",
    "DistilBERT-QA Starter (+ cross-validation) tweaked"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
